Over the course of the last three decades, many studies were conducted in which molecular and clinical data from about individuals with particular medical conditions (a specific cancer type for instance) were collected in order to describe the molecular profile of these conditions and better understand how they influence clinical outcomes.
The anonymized data is either in restricted or public access. To access the raw data (sequencing reads) you will need the authorization of a data user committee. To access processed data (e.g gene expression tables), you don’t need any authorization. It can nevertheless be difficult to retrieve this data by hand and then load it into R.
3.2 Load expression data from the TCGA
For the purpose of this notebook, we have defined a function that will load the data you need from a list of genes and cancer types. The processed data comes from cBioportal, a reference website widely used for the exploration and downloading of data used in research papers.
For that you will simply need to load .RDS objects where the data has already been downloaded for you. For the next steps we will be using R libraries. Load them with the following
suppressPackageStartupMessages(library(RColorBrewer))
suppressPackageStartupMessages(library(glmnet))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(yarrr))
suppressPackageStartupMessages(library(knitr))
The next chunk defines two functions that will be useful for visualizing the results of your classification models. An R function may take as input one or multiple arguments that you may change depending on the task you want to perform. Execute the chunk in order to load these functions into your R environment.
getColors <- function(vec, pal="Dark2", alpha=0.7){
colors <- list()
palette_predefined <- read.csv("../data/colors.csv")
palette_predefined <- setNames(palette_predefined$Color, palette_predefined$Name)
palette_default <- brewer.pal(max(length(unique(vec)),3), pal)
i <- 1
for (v in unique(vec)){
if (toupper(v) %in% names(palette_predefined)){
colors[[v]] <- adjustcolor(palette_predefined[[toupper(v)]], alpha)
} else {
colors[[v]] <- adjustcolor(palette_default[[i]], alpha)
i <- i+1
}
}
colors
}
getConfusionMatrix <- function(labelsPredicted, labelsCorrect, labels=NULL){
if (is.null(labels)){
labels <- unique(union(labelsPredicted, labelsCorrect))
} else {
labels <- unique(labels)
}
confMat <- data.frame(row.names=labels)
for (labelPredicted in labels){
for (labelCorrect in labels){
confMat[labelPredicted, labelCorrect] <- sum(labelsPredicted==labelPredicted & labelsCorrect==labelCorrect)
}
}
confMat
}
plot_coefs_logistic_regression <- function(nCoeffsPlot=20, CoeffsA, CoeffsB, colorA, colorB){
ymax <- max(abs(CoeffsA)[1], abs(CoeffsB)[1])
ymax <- ceiling(ymax/10**(round(log10(ymax))))*10**(round(log10(ymax)))
xx <- barplot(height=c(CoeffsA[1:nCoeffsPlot], CoeffsB[1:nCoeffsPlot]),
col=c(rep(colorA, nCoeffsPlot), rep(colorB, nCoeffsPlot)),
cex.names=0.7, las=2, ylim=c(-ymax, ymax))
text(xx[1:nCoeffsPlot]+1,
y=-1,
label=rownames(CoeffsA)[1:nCoeffsPlot],
pos=2,
cex=0.7,
srt=90)
text(xx[(nCoeffsPlot+1):(2*nCoeffsPlot)]-0.75,
y=1,
label=rownames(CoeffsB)[1:nCoeffsPlot],
pos=4,
cex=0.7,
srt=90)
}
INCLASS WORK Load the list of genes from the cancer gene census file ../data/CancerGeneCensusCOSMIC_20240117.csv and then load the normalized RNA expression profiles for the BLCA and LUSC TCGA pancancer atlas 2018 studies by loading the blca_lusc_exp.RDS object.
More specifically
- Use the
read.csv function (this function is a base R function so you don’t need to load any library to use it) to load the Cancer Gene Census file ../data/CancerGeneCensusCOSMIC_20240117.csv and extract the list of gene symbols into a variable CgcGenes.
- Use the
readRDS function to load the data into a TcgaData list variable.
- Explore the contents of the list.
TODO: rmv 2. Use the LoadcBioportal function to load the TCGA RNA and clinical data for BLCA and LUSC TCGA pancancer atlas 2018 studies. a. You can use either the official TCGA nomenclature (BLCA = bladder urothelial carcinoma, LUSC = lung squamous cell carcinoma; see here for the full list), or the name of the organ of origin (e.g lung = luad & lusc, brain = gbm & lgg). Be careful that simply specifying blca will load all BLCA studies available on the cbioportal while blca_tcga_pan_can_atlas_2018 will consider only the BLCA TCGA pan cancer atlas 2018 study. b. The RNA data should be restricted to only the genes listed in the table in order to limit the memory usage. c. Clinical data is needed d. Mutation data is not needed e. RNA expression should be normalized.
TODO: rmv Hint: In regular expression (i.e computer language), the “OR” logical connector is written |. Moreover, in R T stands for TRUE and F stands for FALSE. Look at the arguments of LoadcBioportal to understand the possibilities of the function.
# YOUR WORK HERE
CgcTable <- read.csv("../data/CancerGeneCensusCOSMIC_20240117.csv")
CgcGenes <- CgcTable$Gene.Symbol
TcgaData <- readRDS("../data/blca_lusc_exp.RDS")
# TcgaData <- LoadcBioportal(StudyId="blca_tcga_pan_can_atlas_2018|lusc_tcga_pan_can_atlas_2018",
# Genes=CgcGenes,
# GenesType="hugoGeneSymbol",
# ClinicNeeded=T,
# MutNeeded=F,
# RNANeeded=T,
# NormalizeRNA=T)
Observe that mutation data were not added to this list because of the argument because TcgaData$MUT has dimensions [0,0]. The TcgaData$STUDY table tells you how many patients have been loaded in each table of each study. It may happen that data modalities (clinical, expression, mutation) are not available across all patients.
INCLASS WORK Intersect the list of patients between the EXP and CLINIC data modalities to select only patients having both clinical data and expression profiles. Then subset and align the two tables on this common list. You will need the intersect R function for this task.
# YOUR WORK HERE
patients_by_modality <- list(row.names(TcgaData$CLINIC), row.names(TcgaData$EXP))
patients_common <- Reduce(intersect, patients_by_modality)
TcgaData$CLINIC <- TcgaData$CLINIC[patients_common,]
TcgaData$EXP <- TcgaData$EXP[patients_common,]
We now have TcgaData$EXP and TcgaData$CLINIC data tables with the same patients as rows and no NAs.
3.3 Description of the data
The first very important step is to visualize and describe your data. It can be tricky when you have huge multidimensional matrices.
For continuous data such as RNA-seq, simply plotting the distribution is a good first step. The expression profiles were normalized using a log-min-max normalization (\(X\) is original profile of size the number of genes, \(Y=log10(X+1)\), \(Z=\frac{Y-min(Y)}{max(Y)-min(Y)}\)) of expression values from cbioportal. The following chunk display the distribution of normalized gene expression across patients and genes for BLCA and LUSC.
set.seed(1995)
colorsStudy <- getColors(TcgaData$CLINIC$study, alpha=1)
genesExp <- colnames(TcgaData$EXP)
genesHist <- sort(sample(genesExp, size=9, replace=F))
par(mfrow=c(3,3), mai=c(0.35,0.4,0.35,0.4))
for (geneHist in genesHist){
mask_blca <- TcgaData$CLINIC$study=="blca"
mask_lusc <- TcgaData$CLINIC$study=="lusc"
DataGene <- rbind(data.frame(Expression=TcgaData$EXP[mask_blca,geneHist], Tumor="blca"),
data.frame(Expression=TcgaData$EXP[mask_lusc,geneHist], Tumor="lusc"))
pirateplot(formula=Expression ~ Tumor,
data=DataGene,
theme=1,
quant=NULL,
pal=colorsStudy,
main=geneHist,
xlab="")
}
A very popular visualisation tool for data matrices are heatmaps. However you may draw a heatmap only if you do not have too many samples and too many variables. The heatmap base function additionally allows you to perform hierarchical clustering of the rows and or the columns to highlight clusters of variables and or samples.
rowNames <- rownames(TcgaData$EXP)
rowStudy <- TcgaData$CLINIC[rowNames, "study"]
rowSideColors <- sapply(rowStudy, function(x) colorsStudy[[x]])
heatmap(as.matrix(TcgaData$EXP), Rowv=NULL, Colv=NULL, keep.dendro=F, RowSideColors=rowSideColors)
INCLASS_WORK Answer to the numbered questions throughout the notebook.
QUESTION 1) What are TPM and RPKM? The log-min-max normalization was applied on the mRNAseq v2 values molecular profiles of cbioportal which correspond to rsem.genes.normalized_results files from TCGA as explained in the documentation [here]https://docs.cbioportal.org/user-guide/faq/#how-is-tcga-rnaseqv2-processed-what-units-are-used. Are the expression profiles in TcgaData$EXP table modified (log-min-max normalization) TPM values? RPKM values?
Hint: To understand what rsem.genes.normalized_results TCGA files are, read the answer to the following post https://www.biostars.org/p/106127/.
ANSWER:
The Transcripts Per Million and Reads Per Kilobase Million are commonly used normalized measures of gene expression. Both metrics account for the library size (total number of reads in a given sample) and for the lengths of genes (the longer the gene, the more reads will align to it as reads are of fixed size). The difference between TPM and RPKM lies in the order between the normalization by library size and by gene length. RPKM normalizes first by library size then by gene length while TPM performs the normalization in the reverse order. As a consequence, the TPM values of all genes of a given sample always sum to 1 million but that is not true for RPKM. TODO: fill
QUESTION 2) How would you qualify the distribution(s) of the log-min-max normalized RNA data of each gene? Give gene names that illustrate the type(s) of distribution you see.
Hint: You may play with the value of the random number generator seed in the chunk with plots per gene or remove it altogether and run the chunk multiple times to explore different genes.
ANSWER:
For the majority of genes, the distribution of the log-min-max normalized RNA data looks approximately normal, indicating that Z-scores downloaded from the cbioportal have a log-normal distribution (examples: BAX, MUC4, MSN). For some genes, the distribution is stacked at 0, indicating that the gene is rarely expressed in the selected samples (examples: CNBD1, FAM47C, ISX, MYOD1, SSX2)
Another popular way of visualising high-dimensional data is first by reducing its dimension and then visualise it into the lower-dimensional representation. The Principal Component Analysis with \(k\) components allows you to find the \(k\) dimensional subspace that retains the most of the variance of the data in the original space. PCA of \(\mathbf{X}\) may be performed by performing Singular Value Decomposition on the centered matrix.
X <- t(t(TcgaData$EXP) - rowMeans(t(TcgaData$EXP)))
resSVD <- svd(X, nu=2, nv=2)
# the scores are the principal components i.e the
# low-dimensional representations of the samples
scores <- resSVD$u %*% diag(resSVD$d[1:2])
# plot the two first principal components
plot(scores[,1],scores[,2],
col=unlist(colorsStudy[TcgaData$CLINIC$study]),
pch=16,main='PCA representation',
xlab=paste("dim1 = ",100*round(resSVD$d[1]^2/sum(resSVD$d^2),3),"%",sep = ""),
ylab=paste("dim2 = ",100*round(resSVD$d[2]^2/sum(resSVD$d^2),3),"%",sep = ""))
legend("bottomright",legend=c("blad","lusc"), pch=16, cex=0.8, col=c(colorsStudy[["blca"]], colorsStudy[["lusc"]]))
3.4 Your first classification model
Train/test split
We would like to have a model to identify the study of origin of the RNA samples from their gene expression. We shall continue to rely on the gene expression of only the Cancer Gene Census. In here we are going to fit a logistic regression model that we saw this morning to perform this classification task.
To begin with, we shall split the data into a training set and a test set.
# split total list of rows betweeen data between train and test row lists
totalIndex <- row.names(TcgaData$CLINIC)
totalSize <- length(totalIndex)
trainProp <- 0.8
trainIndex <- sample(totalIndex, size=round(totalSize*trainProp))
testIndex <- totalIndex[!totalIndex %in% trainIndex]
# record correct labels for train/test
studyCorrectTrain <- as.matrix(TcgaData$CLINIC)[trainIndex,"study"]
studyCorrectTest <- as.matrix(TcgaData$CLINIC)[testIndex,"study"]
cat(paste("Training size:", length(trainIndex), "\nTest size:", length(testIndex)))
## Training size: 713
## Test size: 178
QUESTION 3) Why is it important to split the data into a training set and a test set?
ANSWER:
The splitting of the data into a train and test set allows to leave aside data that has never been seen by the model and that we can use for rigorously assessing the generalization error of the model. The assessment of the model error/performance on the train set is not a good assessment of how well the model may generalize given that your model may be prone to overfitting. The assessment of the model performance from the train set is often overoptimistic.
Fit the model.
The glmnet R package is a great tool for fitting all kinds of linear models. GLM stands for for Generalized Linear Models which a global family of models that encompass many linear models including the linear regression (for continuous predictions), logistic regression (for class predictions), Poisson regression (for count data), Gamma regression (log normal data), etc. Logistic regression is GLM from the binomial family of distributions with a logit link.
INCLASS WORK Use the glmnet R package to fit a logistic regression model on the expression data to predict the study membership of RNA samples.
# YOUR WORK HERE
fit <- glmnet(
# x=as.matrix(TcgaData$EXP[trainIndex,]),
# y=as.factor(TcgaData$CLINIC[trainIndex, "study"]),
x=TcgaData$EXP[trainIndex,],
y=TcgaData$CLINIC[trainIndex, "study"],
family="binomial",
alpha=0,
lambda=0,
)
Assess the model
QUESTION 4) How do you evaluate a classification model? Give at least two metrics you think of to describe the quality of the model.
ANSWER
The most commonly used metrics for describing the quality of a binary classification model are the sensitivity (aka true positive rate or recall) and specificity (aka true negative rate). The sensitivity measures the ability of the model to call all truely positives regardless of the number of false positives. The sensitivity measures the ability of the model to call all truely negative samples regardless of the number of false negatives. In the ideal scenario, your model would make no false positives nor false negatives and would therefore have sensitivty and specificity equal to 1. In practice, increasing sensitivity is only possible at the cost of decreasing specificity and vice-versa.
Other metrics including the precision or predictive positive value which measures the proportion of true positives among all predicted positives. Precision can be combined with recall through harmonic mean to form the F1-score.
INCLASS WORK Use the predict function to get the predictions your model on both the train and test data. Use the predictions to compare against the true values of the tumor type available in TcgaData$CLINIC$study vector. Use the getConfusionMatrix to quantify the quality of the model by computing the confusion matrix. You may also quantify the quality of your model using metrics you mentioned in your previous answer.
Hint: You will need to use the as.matrix function to provide matrices to the newx argument of predict and to arguments labelsCorrect and labelsPredict of getConfusionMatrix. You may check beforehand if your variable is a matrix by running is.matrix(your_variable). To retain the row names when extracting the vector of correct labels, use the syntax as.matrix(your_dataframe)[your_index_selection,"your_column_name"]
# get predictions on train/test
studyPredictedTrain <- # YOUR WORK HERE
studyPredictedTest <- # YOUR WORK HERE
# get accuracies
accuracyTrain <- # YOUR WORK HERE
accuracyTest <- # YOUR WORK HERE
cat(paste("Training accuracy:", accuracyTrain, "\nTest accuracy:", accuracyTest))
# get confusion matrices
confMatTrain <- # YOUR WORK HERE
confMatTest <- # YOUR WORK HERE
# display confusion matrices
kTrain <- kable(confMatTrain, caption="Train")
kTest <- kable(confMatTest, caption="Test")
kables(list(kTrain, kTest), format="html", caption="Confusion matrices BLCA/LUSC")
studyPredictedTrain <- predict(fit, newx=as.matrix(TcgaData$EXP[trainIndex,]), type="class")
studyPredictedTest <- predict(fit, newx=as.matrix(TcgaData$EXP[testIndex,]), type="class")
# get accuracies
accuracyTrain <- round(mean(studyPredictedTrain==studyCorrectTrain),3)
accuracyTest <- round(mean(studyPredictedTest==studyCorrectTest),3)
cat(paste("Training accuracy:", accuracyTrain, "\nTest accuracy:", accuracyTest))
## Training accuracy: 1
## Test accuracy: 1
# get confusion matrices
confMatTrain <- getConfusionMatrix(studyPredictedTrain, studyCorrectTrain)
confMatTest <- getConfusionMatrix(studyPredictedTest, studyCorrectTest)
# display confusion matrices
kTrain <- kable(confMatTrain, caption="Train")
kTest <- kable(confMatTest, caption="Test")
kables(list(kTrain, kTest), format="html", caption="Confusion matrices BLCA/LUSC")
Confusion matrices BLCA/LUSC
Train
| blca |
327 |
0 |
| lusc |
0 |
386 |
|
|
3.5 Another classification task
As evidenced by the performance of a simple classification model, the task of distinguishing the tumor types BLCA and LUSC using the levels of expression of known cancer genes is an easy task. Let us now move to a harder classification task that will aim at predicting the tumor stage using again the levels of expression of selected genes. The tumor stage is encoded by the TNM system devised by the American Joint Committee on Cancer (AJCC). Once the T, N, and M values of the tumor are determined, they are combined and an overall stage of 0, I, II, III, IV is assigned. Sometimes these stages are subdivided as well, using letters such as IIIA and IIIB. Read the page https://www.facs.org/quality-programs/cancer-programs/american-joint-committee-on-cancer/cancer-staging-systems/ to know more about this staging system.
As the biology of cancer cells is known to differ significantly from one tumor type to another, we will focus on one tumor type, namely BLCA. The task is therefore to predict the tumor stage in bladder tumors using expression data from selected genes.
Data preparation
INCLASS WORK In TcgaData$CLINIC table, the stage is encoded in the AJCC_PATHOLOGIC_TUMOR_STAGE variable. To limit the complexity of the classification task, your first job is to simplify the AJCC_PATHOLOGIC_TUMOR_STAGE to remove substage info and record it into the variable SIMPLE_TUMOR_STAGE. Your second job is to further simplify the stage into a binary variable BINARY_TUMOR_STAGE containing “Early stage” for stages I and II and “Late stage” for stages “III” and “IV.”
Hint: You can use the gsub function to remove parts of a string using regex patterns. For instance the command gsub("[X]$", "", mystring) removes the X character from mystring variable if and only if it ends with X. In other words, if mystring is lateX then the command will return late but if mystring is lateXes it will not modify the variable. Regexes are case-sensitive meaning that if mystring is latex then the command will not remove the small cap x at the end. You can add x in your matching pattern in different ways, one of which is to use the regex "[xX]$" instead of "[X]$". For your information, gsub function can be applied on vectors.
For the transformation of values, see the example in the post https://stackoverflow.com/questions/49798684/replacing-values-in-a-column-using-a-named-list for example code of how to remap values in a column of a dataframe using a named vector.
# check thee distribution of tumor stages
table(TcgaData$CLINIC$AJCC_PATHOLOGIC_TUMOR_STAGE)
# remove subtypes
TcgaData$CLINIC$SIMPLE_TUMOR_STAGE <- # YOUR WORK HERE
# check that no NAs were introduced or removed
NAs_before <- is.na(TcgaData$CLINIC$AJCC_PATHOLOGIC_TUMOR_STAGE)
NAs_after <- is.na(TcgaData$CLINIC$SIMPLE_TUMOR_STAGE)
stopifnot(all(NAs_before == NAs_after))
# binarize
TcgaData$CLINIC$BINARY_TUMOR_STAGE <- # YOUR WORK HERE
# check again that no NAs were introduced or removed
NAs_after <- is.na(TcgaData$CLINIC$BINARY_TUMOR_STAGE)
stopifnot(all(NAs_before == NAs_after))
# get a vector of colors
colorsStage <- getColors(TcgaData$CLINIC$BINARY_TUMOR_STAGE, alpha=1)
# check thee distribution of tumor stages
table(TcgaData$CLINIC$AJCC_PATHOLOGIC_TUMOR_STAGE)
##
## STAGE I STAGE IA STAGE IB STAGE II STAGE IIA STAGE IIB STAGE III STAGE IIIA STAGE IIIB
## 4 85 148 131 62 93 143 61 18
## STAGE IV
## 141
# remove subtypes
TcgaData$CLINIC$SIMPLE_TUMOR_STAGE <- gsub("[A-Ca-c]$", "", TcgaData$CLINIC$AJCC_PATHOLOGIC_TUMOR_STAGE)
# check that no NAs were introduced or removed
NAs_before <- is.na(TcgaData$CLINIC$AJCC_PATHOLOGIC_TUMOR_STAGE)
NAs_after <- is.na(TcgaData$CLINIC$SIMPLE_TUMOR_STAGE)
stopifnot(all(NAs_before == NAs_after))
# named vector for simplifying into binary variable
tumor_stage_binary <- c("Early stage"="STAGE I", "Early stage"="STAGE II",
"Late stage"="STAGE III", "Late stage"="STAGE IV")
matches_tumor_stage_binary <- match(TcgaData$CLINIC$SIMPLE_TUMOR_STAGE, tumor_stage_binary)
TcgaData$CLINIC$BINARY_TUMOR_STAGE <- names(tumor_stage_binary)[matches_tumor_stage_binary]
# check again that no NAs were introduced or removed
NAs_after <- is.na(TcgaData$CLINIC$BINARY_TUMOR_STAGE)
stopifnot(all(NAs_before == NAs_after))
# get a vector of colors
colorsStage <- getColors(TcgaData$CLINIC$BINARY_TUMOR_STAGE, alpha=1)
Train/test split and tumor type selection
To begin with, we shall split the data into a training set and a test set.
INCLASS WORK Retrieve the row names (=indices) in TcgaData$CLINIC and TcgaData$EXP corresponding to patients from study=blca patients. Remove the indices of patients with NAs in the variable AJCC_PATHOLOGIC_TUMOR_STAGE, and then split the selected indices into a train and a test indices vectors as done in the previous classification task. Use a train/test ratio of 80%/20%.
# select only BLCA patients
totalIndex <- row.names(TcgaData$CLINIC)
totalBlcaIndex <- # YOUR WORK HERE
totalBlcaSize <- length(totalBlcaIndex)
# remove patients with NAs for AJCC_PATHOLOGIC_TUMOR_STAGE
NAsStageIndex <- # YOUR WORK HERE
totalBlcaIndex <- totalBlcaIndex[!totalBlcaIndex %in% NAsStageIndex]
# split selected indices into train and test
trainProp <- # YOUR WORK HERE
trainBlcaIndex <- # YOUR WORK HERE
testBlcaIndex <- # YOUR WORK HERE
# record correct labels for train/test
stageCorrectBlcaTrain <- as.matrix(TcgaData$CLINIC)[trainBlcaIndex,"BINARY_TUMOR_STAGE"]
stageCorrectBlcaTest <- as.matrix(TcgaData$CLINIC)[testBlcaIndex,"BINARY_TUMOR_STAGE"]
cat(paste("Training size:", length(trainBlcaIndex), "\nTest size:", length(testBlcaIndex)))
# select only BLCA patients
totalIndex <- row.names(TcgaData$CLINIC)
totalBlcaIndex <- totalIndex[TcgaData$CLINIC$study=="blca"]
totalBlcaSize <- length(totalBlcaIndex)
# remove patients with NAs for AJCC_PATHOLOGIC_TUMOR_STAGE
NAsStageIndex <- totalIndex[is.na(TcgaData$CLINIC$AJCC_PATHOLOGIC_TUMOR_STAGE)]
totalBlcaIndex <- totalBlcaIndex[!totalBlcaIndex %in% NAsStageIndex]
# split selected indices into train and test
trainProp <- 0.8
trainBlcaIndex <- sample(totalBlcaIndex, size=round(totalBlcaSize*trainProp))
testBlcaIndex <- totalBlcaIndex[!totalBlcaIndex %in% trainBlcaIndex]
# record correct labels for train/test
stageCorrectBlcaTrain <- as.matrix(TcgaData$CLINIC)[trainBlcaIndex,"BINARY_TUMOR_STAGE"]
stageCorrectBlcaTest <- as.matrix(TcgaData$CLINIC)[testBlcaIndex,"BINARY_TUMOR_STAGE"]
cat(paste("Training size:", length(trainBlcaIndex), "\nTest size:", length(testBlcaIndex)))
## Training size: 326
## Test size: 79
Fit and assess the model.
INCLASS WORK As done in the previous task, use the glmnet R package to fit a binomial logistic regression model on the expression data to predict the binary tumor stage of BLCA RNA samples and then assess the quality of your model.
# fit the model
fitBlca <- glmnet(
x=# YOUR WORK HERE
y=# YOUR WORK HERE
family="binomial",
alpha=0,
lambda=0,
)
# get the model predictions on the train and test splits
BlcaPredictedTrain <- # YOUR WORK HERE
BlcaPredictedTest <- # YOUR WORK HERE
# get the model train/test accuracies
accuracyBlcaTrain <- # YOUR WORK HERE
accuracyBlcaTest <- # YOUR WORK HERE
cat(paste("Training accuracy:", accuracyBlcaTrain, "\nTest accuracy:", accuracyBlcaTest))
# get the confusion matrices
confMatBlcaTrain <- getConfusionMatrix(BlcaPredictedTrain, stageCorrectBlcaTrain)
confMatBlcaTest <- getConfusionMatrix(BlcaPredictedTest, stageCorrectBlcaTest)
# display the confusion matrices
kBlcaTrain <- kable(confMatBlcaTrain, caption="Train")
kBlcaTest <- kable(confMatBlcaTest, caption="Test")
kables(list(kBlcaTrain, kBlcaTest), format="html", caption="Confusion matrices BLCA tumor stage")
# fit the model
fitBlca <- glmnet(
x=as.matrix(TcgaData$EXP[trainBlcaIndex,]),
y=as.factor(as.matrix(TcgaData$CLINIC)[trainBlcaIndex, "BINARY_TUMOR_STAGE"]),
family="binomial",
alpha=0,
lambda=0,
)
# get the model predictions on the train and test splits
BlcaPredictedTrain <- predict(fitBlca, newx=as.matrix(TcgaData$EXP[trainBlcaIndex,]), type="class")
BlcaPredictedTest <- predict(fitBlca, newx=as.matrix(TcgaData$EXP[testBlcaIndex,]), type="class")
# get the model train/test accuracies
accuracyBlcaTrain <- round(mean(BlcaPredictedTrain==stageCorrectBlcaTrain), 3)
accuracyBlcaTest <- round(mean(BlcaPredictedTest==stageCorrectBlcaTest), 3)
cat(paste("Training accuracy:", accuracyBlcaTrain, "\nTest accuracy:", accuracyBlcaTest))
## Training accuracy: 1
## Test accuracy: 0.684
# get the confusion matrices
confMatBlcaTrain <- getConfusionMatrix(BlcaPredictedTrain, stageCorrectBlcaTrain)
confMatBlcaTest <- getConfusionMatrix(BlcaPredictedTest, stageCorrectBlcaTest)
# display the confusion matrices
kBlcaTrain <- kable(confMatBlcaTrain, caption="Train")
kBlcaTest <- kable(confMatBlcaTest, caption="Test")
kables(list(kBlcaTrain, kBlcaTest), format="html", caption="Confusion matrices BLCA tumor stage")
Confusion matrices BLCA tumor stage
Train
| Early stage |
104 |
0 |
| Late stage |
0 |
222 |
|
Test
| Early stage |
15 |
13 |
| Late stage |
12 |
39 |
|
QUESTION 5) Comment on the difference of model performance between the train and test splits. How is this situation described in machine learning terms? How do you explain this situation and can you think of a strategy to overcome it?
ANSWER:
There is a large difference in the performance of the model in the train and test sets. The model is therefore very capable of explaining data it has already seen but does not generalize as well to new data. This situation is known as “overfitting” in machine learning terms. Overfitting is often caused by an imbalance between the number of parameters of the model and the number of observations. This problem is particularly acute in computational oncology as we have limited number of patients for which high-dimensional data is collected. To limit overfitting, a good strategy is to regularize the model. Another strategy is to rely on expert knowledge to select a reasonable of predictors compared to the number of observations.
Understand the model coefficients
The medical doctor you are working with is not very satisfied with your model and would like to know what are the most discriminative genes to see if we could think of a way to improve the model.
INCLASS WORK Extract the coefficients of the model using coefficients(fitBcla) and then store coefficients contributing to predicting “Early stage” and the coefficients contributing to predicting “Late stage” into two distinct character vectors. The sign of the coefficients helps you distinguish the two. We will then visualize the values of the coefficients assigned to these genes by the model.
Hint: Observe that when fitting the model, we provided the column of labels as a factor variable using the as.factor function. This factor column is then converted into numeric values internally to fit the model. In a binary classification task, negative coefficients contribute to predicting the “0” class while positive coefficients contribute to predicting the “1” class. You can check the order levels of a factor variable by running levels(your_factor_variable).
# extract coefficients and remove the intercept
fitBlcaCoefs <- # YOUR WORK HERE
fitBlcaCoefs <- fitBlcaCoefs[!grepl("(Intercept)", rownames(fitBlcaCoefs)),,drop=F]
# select coefficients predicting towards Early stage and sort them
EarlyGenes <- # YOUR WORK HERE
EarlyGenes <- EarlyGenes[order(abs(EarlyGenes), decreasing=T),,drop=F]
# select coefficients predicting towards Late stage and sort them
LateGenes <- # YOUR WORK HERE
LateGenes <- LateGenes[order(abs(LateGenes), decreasing=T),,drop=F]
# extract coefficients and remove the intercept
fitBlcaCoefs <- as.matrix(coefficients(fitBlca))
fitBlcaCoefs <- fitBlcaCoefs[!grepl("(Intercept)", rownames(fitBlcaCoefs)),,drop=F]
# select coefficients predicting towards Early stage and sort them
EarlyGenes <- fitBlcaCoefs[fitBlcaCoefs < 0,,drop=F]
EarlyGenes <- EarlyGenes[order(abs(EarlyGenes), decreasing=T),,drop=F]
# select coefficients predicting towards Late stage and sort them
LateGenes <- fitBlcaCoefs[fitBlcaCoefs > 0,,drop=F]
LateGenes <- LateGenes[order(abs(LateGenes), decreasing=T),,drop=F]
We shall now visualize the 20 most discriminative genes for each label.
plot_coefs_logistic_regression(nCoeffsPlot=20, CoeffsA=LateGenes, CoeffsB=EarlyGenes,
colorA=colorsStage[["Early stage"]], colorB=colorsStage[["Late stage"]])
INCLASS WORK For the top and second top most discriminative gene in favor of “Early stage” and the top and second top most discriminative genes in favor of “Late stage,” give the average value and standard deviation of the normalized expression observed in BLCA Late stage and Blca Early stage RNA samples respectively.
TopGenesEarly <- # YOUR WORK HERE
TopGenesLate <- # YOUR WORK HERE
for (gene in c(TopGenesEarly, TopGenesLate)){
for (stage in c("Early stage", "Late stage")){
StageIndex <- totalIndex[TcgaData$CLINIC$BINARY_TUMOR_STAGE==stage]
BlcaStageIndex <- totalBlcaIndex[totalBlcaIndex %in% StageIndex]
# get the mean of gene expression for the selected gene ("gene" variable) and for the BLCA patients from the
# selected stage ("stage" variable). Round values to 3 digits.
m <- # YOUR WORK HERE
s <- # YOUR WORK HERE
print(paste("-the mean and std of gene", gene, "in BLCA", stage, "are", m, "and", s))
}
}
par(mfrow=c(2,2), mai=c(0.5,0.75,0.5,0.5))
for (geneHist in c(TopGenesEarly, TopGenesLate)){
EarlyStageIndex <- totalIndex[TcgaData$CLINIC$BINARY_TUMOR_STAGE=="Early stage"]
LateStageIndex <- totalIndex[TcgaData$CLINIC$BINARY_TUMOR_STAGE=="Late stage"]
BlcaEarlyStageIndex <- totalBlcaIndex[totalBlcaIndex %in% EarlyStageIndex]
BlcaLateStageIndex <- totalBlcaIndex[totalBlcaIndex %in% LateStageIndex]
DataGene <- rbind(data.frame(Expression=TcgaData$EXP[BlcaEarlyStageIndex,geneHist], Stage="Early stage"),
data.frame(Expression=TcgaData$EXP[BlcaLateStageIndex,geneHist], Stage="Late stage"))
pirateplot(formula=Expression ~ Stage,
data=DataGene,
theme=1,
quant=NULL,
pal=colorsStage,
main=geneHist,
xlab="",
ylab="")
}
TopGenesEarly <- rownames(EarlyGenes)[1:2]
TopGenesLate <- rownames(LateGenes)[1:2]
for (gene in c(TopGenesEarly, TopGenesLate)){
for (stage in c("Early stage", "Late stage")){
StageIndex <- totalIndex[TcgaData$CLINIC$BINARY_TUMOR_STAGE==stage]
BlcaStageIndex <- totalBlcaIndex[totalBlcaIndex %in% StageIndex]
# get the mean of gene expression for the selected gene ("gene" variable) and for the BLCA patients from the
# selected stage ("stage" variable). Round values to 3 digits.
m <- round(mean(TcgaData$EXP[BlcaStageIndex, gene]), 3)
s <- round(sd(TcgaData$EXP[BlcaStageIndex, gene]), 3)
print(paste("-the mean and std of gene", gene, "in BLCA", stage, "are", m, "and", s))
}
}
## [1] "-the mean and std of gene CHD4 in BLCA Early stage are 0.612 and 0.017"
## [1] "-the mean and std of gene CHD4 in BLCA Late stage are 0.612 and 0.018"
## [1] "-the mean and std of gene SRSF2 in BLCA Early stage are 0.592 and 0.018"
## [1] "-the mean and std of gene SRSF2 in BLCA Late stage are 0.587 and 0.018"
## [1] "-the mean and std of gene ABL1 in BLCA Early stage are 0.512 and 0.03"
## [1] "-the mean and std of gene ABL1 in BLCA Late stage are 0.521 and 0.03"
## [1] "-the mean and std of gene ERCC5 in BLCA Early stage are 0.501 and 0.025"
## [1] "-the mean and std of gene ERCC5 in BLCA Late stage are 0.498 and 0.024"
par(mfrow=c(2,2), mai=c(0.5,0.75,0.5,0.5))
for (geneHist in c(TopGenesEarly, TopGenesLate)){
EarlyStageIndex <- totalIndex[TcgaData$CLINIC$BINARY_TUMOR_STAGE=="Early stage"]
LateStageIndex <- totalIndex[TcgaData$CLINIC$BINARY_TUMOR_STAGE=="Late stage"]
BlcaEarlyStageIndex <- totalBlcaIndex[totalBlcaIndex %in% EarlyStageIndex]
BlcaLateStageIndex <- totalBlcaIndex[totalBlcaIndex %in% LateStageIndex]
DataGene <- rbind(data.frame(Expression=TcgaData$EXP[BlcaEarlyStageIndex,geneHist], Stage="Early stage"),
data.frame(Expression=TcgaData$EXP[BlcaLateStageIndex,geneHist], Stage="Late stage"))
pirateplot(formula=Expression ~ Stage,
data=DataGene,
theme=1,
quant=NULL,
pal=colorsStage,
main=geneHist,
xlab="",
ylab="")
}
Improve the model
The confusion matrices of the binomial logistic regression model we trained show that the task of predicting the tumor stage from expression data is not so easy and/or our modeling strategy can be improved. In our setting, we have more predictors (=features) than observations, a situtation known as “p > n” in machine learning terms. In order to guide the model towards a solution involving a smaller number of predictors, we will use regularization techniques that impose constraints on the scale of the coefficients and/or the number of non-zero coefficients.
An existing procedure to penalize models with too high coefficients values is the lasso regularization. Lasso (like other regularization procedures) comes with an additional parameter to optimize, called the regularization parameter \(\lambda\). The recommended way to find \(\lambda\) is with a k-fold cross validation strategy.
As a reminder, below is the objective function that is being minimized in order to find the best coefficients for the model
\[\mathcal{L}_{\text{log}} + \lambda \sum_{j=1}^p \beta_j^2\]
where \(p\) is the number of variables and \(\beta_1, \ldots, \beta_p\) are the model’s coefficients.
INCLASS WORK: Use the cv.glmnet function from glmnet to fit a lasso logistic regression model to predict the binary tumor stage in BLCA patients. The cross-validation is here to help us find an optimal value of \(\lambda\).
Hint: Check the documentation of glmnet function to understand what alpha and beta parameters are check the documentation of cv.glmnet function to understand what nfolds parameter is.
# fit the model
fitBlcaCv <- cv.glmnet(
# YOUR WORK HERE
)
plot(fitBlcaCv)
print(paste("You best value of lambda is :", fitBlcaCv$lambda.min))
# fit the model
fitBlcaCv <- cv.glmnet(
x=as.matrix(TcgaData$EXP[trainBlcaIndex,]),
y=as.factor(as.matrix(TcgaData$CLINIC)[trainBlcaIndex, "BINARY_TUMOR_STAGE"]),
family="binomial",
alpha=1,
nfolds=10
)
plot(fitBlcaCv)

print(paste("You best value of lambda is :", fitBlcaCv$lambda.min))
## [1] "You best value of lambda is : 0.0549466509328618"
The value of the regularization parameter (often called \(\lambda\)) depends on the absolute scale of your predictors. The higher this scale compared to the scale of the likelihood function, the smaller the regularisation parameter should be to balance against the relative importance of the likelihood function.
INCLASS WORK: Rerun the lasso logistic regression using the optimal values of \(\lambda\) and evaluate its quality. Show again the values of the coefficients of the top 20 most discriminative genes for both “Early stage” and “Late stage.”
# logistic regression fit
fitBlcaReg <-glmnet(
# YOUR WORK HERE
)
# get the model predictions on the train and test splits
BlcaPredictedTrain <- # YOUR WORK HERE
BlcaPredictedTest <- # YOUR WORK HERE
# get the model train/test accuracies
accuracyBlcaTrain <- # YOUR WORK HERE
accuracyBlcaTest <- # YOUR WORK HERE
cat(paste("Training accuracy:", accuracyBlcaTrain, "\nTest accuracy:", accuracyBlcaTest))
# get the confusion matrices
confMatBlcaTrain <- # YOUR WORK HERE
confMatBlcaTest <- # YOUR WORK HERE
# display the confusion matrices
kBlcaTrain <- kable(confMatBlcaTrain, caption="Train")
kBlcaTest <- kable(confMatBlcaTest, caption="Test")
kables(list(kBlcaTrain, kBlcaTest), format="html", caption="Confusion matrices BLCA tumor stage")
# logistic regression fit
fitBlcaReg <-glmnet(
x=as.matrix(TcgaData$EXP[trainBlcaIndex,]),
y=as.factor(as.matrix(TcgaData$CLINIC)[trainBlcaIndex, "BINARY_TUMOR_STAGE"]),
family="binomial",
alpha=1,
lambda=fitBlcaCv$lambda.min
)
# get the model predictions on the train and test splits
BlcaPredictedTrain <- predict(fitBlcaReg, newx=as.matrix(TcgaData$EXP[trainBlcaIndex,]), type="class")
BlcaPredictedTest <- predict(fitBlcaReg, newx=as.matrix(TcgaData$EXP[testBlcaIndex,]), type="class")
# get the model train/test accuracies
accuracyBlcaTrain <- round(mean(BlcaPredictedTrain==stageCorrectBlcaTrain), 3)
accuracyBlcaTest <- round(mean(BlcaPredictedTest==stageCorrectBlcaTest), 3)
cat(paste("Training accuracy:", accuracyBlcaTrain, "\nTest accuracy:", accuracyBlcaTest))
## Training accuracy: 0.755
## Test accuracy: 0.696
# get the confusion matrices
confMatBlcaTrain <- getConfusionMatrix(BlcaPredictedTrain, stageCorrectBlcaTrain)
confMatBlcaTest <- getConfusionMatrix(BlcaPredictedTest, stageCorrectBlcaTest)
# display the confusion matrices
kBlcaTrain <- kable(confMatBlcaTrain, caption="Train")
kBlcaTest <- kable(confMatBlcaTest, caption="Test")
kables(list(kBlcaTrain, kBlcaTest), format="html", caption="Confusion matrices BLCA tumor stage")
Confusion matrices BLCA tumor stage
Train
| Late stage |
205 |
63 |
| Early stage |
17 |
41 |
|
Test
| Late stage |
48 |
20 |
| Early stage |
4 |
7 |
|
# extract coefficients and remove the intercept
fitBlcaRegCoefs <- as.matrix(coefficients(fitBlcaReg))
fitBlcaRegCoefs <- fitBlcaRegCoefs[!grepl("(Intercept)", rownames(fitBlcaRegCoefs)),,drop=F]
# select coefficients predicting towards Early stage and sort them
EarlyRegGenes <- fitBlcaRegCoefs[fitBlcaRegCoefs < 0,,drop=F]
EarlyRegGenes <- EarlyRegGenes[order(abs(EarlyRegGenes), decreasing=T),,drop=F]
# select coefficients predicting towards Late stage and sort them
LateRegGenes <- fitBlcaRegCoefs[fitBlcaRegCoefs > 0,,drop=F]
LateRegGenes <- LateRegGenes[order(abs(LateRegGenes), decreasing=T),,drop=F]
plot_coefs_logistic_regression(nCoeffsPlot=20, CoeffsA=LateRegGenes, CoeffsB=EarlyRegGenes,
colorA=colorsStage[["Early stage"]], colorB=colorsStage[["Late stage"]])

QUESTION 6) Have you been able to overcome the situation described previously? Has Lasso regularization changed the accuracy of your model? Are you surprised?
ANSWER:
The lasso has a slightly increased accuracy (perfect accuracy) as compared to our first model. It is not surprising on this classificiation task as it is very easy to fit a linear model that perfectly discriminates between the BLCA and LUSC transcriptomic profiles. In more complex settings, using Lasso may help reduce overfitting and therefore decrease the generalization error. The choice of using regularization or not is very task-specific.
QUESTION 7) If you go to see a clinician and tell him about your new model, do you think he will want to use it? Can you check the expression profile of the top 2 most discriminative genes in each direction as done previously to argument your answer.
Hint: You should reuse and adapt code from previous “inclass work.”
ANSWER:
TopRegGenesEarly <- rownames(EarlyRegGenes)[1:2]
TopRegGenesLate <- rownames(LateRegGenes)[1:2]
TopRegGenesEarly <- setdiff(TopRegGenesEarly, NA)
TopRegGenesLate <- setdiff(TopRegGenesLate, NA)
for (gene in c(TopRegGenesEarly, TopRegGenesLate)){
for (stage in c("Early stage", "Late stage")){
StageIndex <- totalIndex[TcgaData$CLINIC$BINARY_TUMOR_STAGE==stage]
BlcaStageIndex <- totalBlcaIndex[totalBlcaIndex %in% StageIndex]
# get the mean of gene expression for the selected gene ("gene" variable) and for the BLCA patients from the
# selected stage ("stage" variable). Round values to 3 digits.
m <- round(mean(TcgaData$EXP[BlcaStageIndex, gene]), 3)
s <- round(sd(TcgaData$EXP[BlcaStageIndex, gene]), 3)
print(paste("-the mean and std of gene", gene, "in BLCA", stage, "are", m, "and", s))
}
}
## [1] "-the mean and std of gene ETV6 in BLCA Early stage are 0.516 and 0.033"
## [1] "-the mean and std of gene ETV6 in BLCA Late stage are 0.502 and 0.034"
## [1] "-the mean and std of gene SRGAP3 in BLCA Early stage are 0.376 and 0.074"
## [1] "-the mean and std of gene SRGAP3 in BLCA Late stage are 0.344 and 0.077"
## [1] "-the mean and std of gene RPN1 in BLCA Early stage are 0.63 and 0.023"
## [1] "-the mean and std of gene RPN1 in BLCA Late stage are 0.639 and 0.02"
## [1] "-the mean and std of gene OMD in BLCA Early stage are 0.077 and 0.087"
## [1] "-the mean and std of gene OMD in BLCA Late stage are 0.163 and 0.117"
par(mfrow=c(2,2), mai=c(0.5,0.75,0.5,0.5))
for (geneHist in c(TopRegGenesEarly, TopRegGenesLate)){
EarlyStageIndex <- totalIndex[TcgaData$CLINIC$BINARY_TUMOR_STAGE=="Early stage"]
LateStageIndex <- totalIndex[TcgaData$CLINIC$BINARY_TUMOR_STAGE=="Late stage"]
BlcaEarlyStageIndex <- totalBlcaIndex[totalBlcaIndex %in% EarlyStageIndex]
BlcaLateStageIndex <- totalBlcaIndex[totalBlcaIndex %in% LateStageIndex]
DataGene <- rbind(data.frame(Expression=TcgaData$EXP[BlcaEarlyStageIndex,geneHist], Stage="Early stage"),
data.frame(Expression=TcgaData$EXP[BlcaLateStageIndex,geneHist], Stage="Late stage"))
pirateplot(formula=Expression ~ Stage,
data=DataGene,
theme=1,
quant=NULL,
pal=colorsStage,
main=geneHist,
xlab="",
ylab="")
}
Firstly, as seen above, the reduced number of non-zero coefficients makes interpretability much easier. Secondly the difference in mean and standard deviation of the gene expression of the top “Early stage” and “Late stage” genes selected by Lasso looks are slightly more convincing.